Bayesian Non-Exhaustive Classification for Active Online Name Disambiguation

نویسندگان

  • Baichuan Zhang
  • Murat Dundar
  • Mohammad Al Hasan
چکیده

Œe name disambiguation task partitions a collection of records pertaining to a given name, such that there is a one-to-one correspondence between the partitions and a group of people, all sharing that given name. Most existing solutions for this task are proposed for static data. However, more realistic scenarios stipulate emergence of records in a streaming fashion where records may belong to known as well as unknown persons all sharing the same name. Œis requires a flexible name disambiguation algorithm that can not only classify records of known persons represented in the training data by their existing records but can also identify records of new ambiguous personswith no existing records included in the initial training dataset. Toward achieving this objective, in this paper we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation. In particular, we present a Dirichlet Process GaussianMixtureModel (DPGMM) as a core engine for online name disambiguation task. Meanwhile, two online inference algorithms, namely one-pass Gibbs sampler and Sequential Importance Sampling with Resampling (also known as particle filtering), are proposed to simultaneously perform online classification and new class discovery. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups.Our experimental results demonstrate that the proposed method is significantly beŠer than existing methods for performing online name disambiguation task. We also propose an interactive version of our online name disambiguation method designed to leverage user feedback to improve prediction accuracy. ACM Reference format: Baichuan Zhang, Murat Dundar, andMohammad Al Hasan. 2016. Bayesian Non-Exhaustive Classification for Active Online Name Disambiguation. In Proceedings of ACM Conference, Washington, DC, USA, July 2017 (Confer-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quantification of the relationships between volume and intensity of exhaustive treadmill running in active young men

ABSTRACT: Aim: Quantifying the relationship between volume and intensity as key components of training is a precise manner that is complicated for most coaches.  The aim of this study was to quantify the inverse relationships between training volume and intensity during exhaustive treadmill running among active young men. Method and Material: 32 active young men aged 21 years selected as subjec...

متن کامل

Fast Author Name Disambiguation in CiteSeer

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative machine learning framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, D...

متن کامل

Efficient Name Disambiguation for Large-Scale Databases

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters p...

متن کامل

Metadata for Name Disambiguation and Collocation

Searching names of persons, families, and organizations is often difficult in online databases because different persons or organizations frequently share the same name and because a single person’s or organization’s name may appear in different forms in various online documents. Databases and search engines can use metadata as a tool to solve the problem of name ambiguity and name variation in...

متن کامل

Merging error analysis of name disambiguation based on author similarity

Falsely identifying different authors as one is called merging error in the name disambiguation of coauthorship networks. Research on the measurement and distribution of merging errors helps to collect high quality coauthorship networks. In the aspect of measurement, we provide a Bayesian model to measure the errors through author similarity. We illustratively use the model and coauthor similar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1708.04531  شماره 

صفحات  -

تاریخ انتشار 2017